A segmental framework for fully-unsupervised large-vocabulary speech recognition
نویسندگان
چکیده
Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early systems focused on identifying isolated recurring terms in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units—effectively performing unsupervised speech recognition. To our knowledge, this article presents the first such system evaluated on largevocabulary multi-speaker data. The system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a variety of measures including word error rate (obtained by mapping the unsupervised output to ground truth transcriptions). We show that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less speakerand gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding). Our system’s discovered clusters are still less pure than those of two multi-speaker term discovery systems, but provide far greater coverage.
منابع مشابه
Fully unsupervised small-vocabulary speech recognition using a segmental Bayesian model
Current supervised speech technology relies heavily on transcribed speech and pronunciation dictionaries. In settings where unlabelled speech data alone is available, unsupervised methods are required to discover categorical linguistic structure directly from the audio. We present a novel Bayesian model which segments unlabelled input speech into word-like units, resulting in a complete unsuper...
متن کاملA segmental approach to text-independent speaker verification
Current text-independent speaker veri cation systems are usually based on modeling globally the probability density function (PDF) of the speaker feature vectors. In this paper, segmental approaches to text-independent speaker veri cation are discussed. Unlike the schemes based on Large Vocabulary Continuous Speech Recognition (LVCSR) with previously trained phone models, our systems are based ...
متن کاملUnsupervised Language Model Adaptation for Lecture Speech Recognition
This paper addresses speaker adaptation of language model in large vocabulary spontaneous speech recognition. In spontaneous speech, the expression and pronunciation of words vary a lot depending on the speaker and topic. Therefore, we present unsupervised methods of language model adaptation to a specific speaker by (1) making direct use of the initial recognition result for generating an enha...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملSCARF: A Segmental CRF Speech Recognition System
We propose a theoretical framework for doing speech recognition with segmental conditional random fields, and describe the impleme-nation of a toolkit for experimenting with these models. This framework allows users to easily incorporate multiple detector streams into a discriminatively trained direct model for large vocabulary continuous speech recognition. The detector streams can operate at ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computer Speech & Language
دوره 46 شماره
صفحات -
تاریخ انتشار 2017